feat(keypoint-detection): add COCO OKS-AP evaluation#949
Open
jeon185 wants to merge 1 commit into
Open
Conversation
Adds the eval stage for keypoint-detection (ViTPose), completing config -> build -> perf -> eval for the COCO-keypoint models in #284. - metrics/keypoint.py: KeypointAPMetric computes the COCO keypoint score (OKS-based AP over 0.50:0.95) via pycocotools COCOeval, the same way object-detection reuses the COCO mAP protocol. - keypoint_detection_evaluator.py: top-down evaluator. transformers has no keypoint-detection pipeline, so it drives the image processor and ONNX model directly - per ground-truth person box it runs preprocess -> model -> post_process_pose_estimation and scores against GT keypoints. ViTPose exports a static batch of 1, so each person crop runs separately and the heatmaps are stacked for post-processing. Uses GT person boxes (standard COCO top-down, isolates pose accuracy from detection). - scripts/build_coco_keypoints.py: builds a local COCO val keypoints dataset; downloads annotations once and fetches images individually so a subset does not need the full image zip. - Schema, evaluator registry, default dataset, unit tests. Verified on the five COCO 17-keypoint models (vitpose-base-simple, vitpose-plus-{small,base,large,huge}): config -> build -> perf -> eval all pass and return COCO AP/AR. synthpose-vitpose-huge-hf is not covered yet. It predicts 52 anatomical keypoints rather than COCO's 17, so it can't be scored against COCO ground truth - the keypoint sets don't line up, and OKS is only defined when they do. Right now the metric detects this mismatch and raises a clear error instead of failing deep inside pycocotools. KeypointAPMetric already takes sigmas and keypoint_names as arguments, so supporting SynthPose mainly needs a dataset with its 52-keypoint ground truth plus the matching OKS sigmas; I'd rather confirm the dataset/sigmas choice in review before adding that. Open to suggestions on whether to land it here or as a follow-up. Refs #284.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the eval stage for keypoint-detection (ViTPose), so the COCO-keypoint models from #284 now go through
config -> build -> perf -> eval. Stacked on #905 (the config/build/perf enablement) - that one should go in first.What's here:
metrics/keypoint.py- KeypointAPMetric. Computes the COCO keypoint score (OKS-based AP over 0.50:0.95) with pycocotools COCOeval, the same way object-detection already reuses the COCO mAP protocol.keypoint_detection_evaluator.py- top-down evaluator. transformers has no keypoint-detection pipeline, so it runs the image processor and ONNX model directly: for each ground-truth person box it does preprocess -> model -> post_process_pose_estimation and scores against the GT keypoints. ViTPose is exported with a static batch of 1, so each person crop runs separately and the heatmaps are stacked back together for post-processing. It uses the GT person boxes (standard COCO top-down protocol - keeps the score about pose accuracy, not detection).scripts/build_coco_keypoints.py- builds a local COCO val keypoints dataset. COCO has no script-free HF mirror for person keypoints, so this downloads the annotations once and fetches images individually, which means a small subset doesn't need the full image zip.Verified on the five COCO 17-keypoint models (vitpose-base-simple and vitpose-plus-{small,base,large,huge}): config -> build -> perf -> eval all pass and return COCO AP/AR. AP rises with model size as you'd expect. Absolute numbers are on the low side right now because the build quantizes with random calibration data, but relative comparison holds.
synthpose-vitpose-huge-hf - not covered yet
This is the one model from #284 that this PR does not evaluate. It predicts 52 anatomical keypoints instead of COCO's 17, so it can't be scored against COCO ground truth - the keypoint sets don't line up and OKS is only defined when they do.
How it's handled for now: the metric checks the keypoint count up front and raises a clear, actionable error instead of failing with a numpy broadcast error deep inside pycocotools.
Idea for finishing it: KeypointAPMetric already takes
sigmasandkeypoint_namesas arguments, so the main missing piece is a dataset with SynthPose's 52-keypoint ground truth plus the matching OKS sigmas. I'd rather agree on the dataset and sigmas in review before adding that - happy to land it in this PR or as a follow-up, whichever you prefer.Refs #284.